-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add visitor for converting kernel expressions to engine expressions #363
Add visitor for converting kernel expressions to engine expressions #363
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #363 +/- ##
==========================================
- Coverage 78.94% 77.25% -1.70%
==========================================
Files 52 54 +2
Lines 10756 10994 +238
Branches 10756 10994 +238
==========================================
+ Hits 8491 8493 +2
- Misses 1796 2032 +236
Partials 469 469 ☔ View full report in Codecov by Sentry. |
FFI code for schema and expressions have been moved to their own modules. Add prototype of the visitor to enable kernel expresison -> engine expression
1814c84
to
57fdf77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Pretty much there, just a couple of small things.
} | ||
call!(visitor, visit_array_literal, sibling_list_id, child_list_id); | ||
} | ||
fn visit_struct_literal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrmm, this is a tricky one. I don't love converting the field names into string scalars and maintaining two lists.
We already have a way of doing iterators, so what if we visit all the values, and then visit_struct_literal
could take an iterator over KernelStringSlice
that will return field names, and the list id where the engine can find the value for each field.
I don't love that solution either, but it feels more clean than the engine having to unwrap all these string scalars into just strings. The downside is that you're pairing an iterator to a list, so the engine has to deal with potentially two different ways of iterating.
I realize this might all change since we're potentially doing away with/seriously changing StructData
, so if you think it's not worth tackling until then I'm okay with this in the short term.
also aside: can we rename this internal function so it doesn't have the same name as the visitor one, it's confusing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended as a quick placeholder from the beginning because StructField
would've been better to pass. On top of that we're planning a major rework of scalars/structs/arrays, so I think this is okay as a first draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed kernel visitor functions to visit_expression_*
👍
Also, I think the coverage is listed as so low here because the testing happens in Fine to take that as a follow-up. Please create an issue if that's the plan. |
Yup looks like it - created #425 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok whew that was a lot haha - haven't reviewed C in a while.. LGTM but left a few nits/follow-ups/etc.
run: | | ||
pushd ffi/examples/read-table | ||
mkdir build | ||
pushd build | ||
cmake .. | ||
make | ||
make test | ||
- name: build and run visit-expression test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this run in parallel with the first (low prio)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is sequential.
} | ||
|
||
void free_expression_list(ExpressionItemList list); | ||
void free_expression_item(ExpressionItem ref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you said you were battling memory leak - did you valgrind to validate it now? Wonder if we should include valgrind tests somehow (maybe just do fastest/easiest thing for now - like docs that say "I ran valgrind with blah blah
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I considered that. There's some platform stuff that I'd have to do I think. For example, macos doesn't have valgrind, it uses leaks
. Linux has valgrind. I'll give it a shot to run it through valgrind and leaks depending on os.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Added a linux-only valgrind test. Macos (and i think windows) don't have valgrind, and I didn't think the cross platform test is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea just linux I think would certainly suffice :)
#[cfg(feature = "test-ffi")] | ||
pub mod test_ffi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also instead of making new feature flag we could just put behind usual test
cfg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this. There's no easy way to do this as far as I know. cbindgen
ignores all code under #[cfg(test)]
.
One idea is to have a separate build.rs
that's exclusively for testing. This can include a test crate using the bulider.
Either way we'll have to rethink how we handle our current ffi tests imo
Also @OussamaSaoudi-db can we update the PR description? There are a number of changes here and I think it would help if we have a PR desc that says something to the effect of a list (1) moves old expression visitor to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good. A few items to address before merge.
@@ -20,7 +20,8 @@ pub struct ArrayData { | |||
} | |||
|
|||
impl ArrayData { | |||
pub fn new(tpe: ArrayType, elements: Vec<Scalar>) -> Self { | |||
pub fn new(tpe: ArrayType, elements: impl IntoIterator<Item = impl Into<Scalar>>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this idiom, and how it gets annoying for passing empty lists (see above, where vec![]
no longer compiles and we have to instead do Vec::<Scalar>::new()
.
I couldn't find a way to infer a "reasonable" type from []
or vec![]
, so the only other option is to define an empty
method. But that's not great, either.
In the end, it may not matter much, because these new-empty cases are almost entirely confined to test code. But on the other hand most expression-using code is actually test code, and there's an argument for making tests easier to write and maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's definitely annoying 😔 I personally want to see macros to easily put together expressions, struct data, etc. Perhaps such a macro could handle this case. but we're a long way from that point.
/// ([`usize`]) the engine wants, and will be passed back to the engine to identify the list in the | ||
/// future. | ||
/// | ||
/// Every expression the kernel visits belongs to some list of "sibling" elements. The schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: "sibling" has always seemed a confusing/unhelpful name to me (which is bad, given that I named it). Maybe it's better termed an "owner" or "parent" list? (as in, the list that owns me -- different from the child list that I own)?
Open to other ideas, and certainly doesn't need to be addressed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it took me a while to understand it. I think parent
and owner
might help.
Bit of a curiosity question: Any prior work on this style of visitor pattern? I've seen examples of visitor pattern where we simply get back the item. Here we're providing a list to put the item into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not seen this "double trampoline" pattern previously. It is motivated by kernel's particular constraints when it comes to memory management (engine and kernel have separate memory and they don't mix nicely).
@@ -75,6 +75,7 @@ pub mod snapshot; | |||
pub mod table; | |||
pub(crate) mod utils; | |||
|
|||
pub use delta_kernel_derive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich I wanted to flag this. I had to do this to do this because the column_name
macro (used by column_expr
) uses the macro from delta_kernel_derive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems ok? But I'll admit I don't know the implications of making that crate public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a separate crate so I think we can tackle this two ways?
- (this approach) we can export it via kernel since it's a dependency of kernel or
- we can just import it as a separate crate into FFI crate right?
seems like better separation of concerns to just to (2) but since kernel will always depend on delta_kernel_derive, I could also be okay with option (1).
curious to hear from @nicklan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems ok? But I'll admit I don't know the implications of making that crate public?
AFAIK the crate is already public, and this is just re-exporting it as a public sub-crate from kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Laying out the 2 options was rly helpful!
It feels like it also makes sense to do 1 from a usability standpoint. Suppose delta-rs wants to create columns using column_expr!
or column_name!
. Without the re-export, they'd get a compilation error coming from the macro, and they'd have to go through a couple layers of macro before finding why it fails. This was my experience using the macros
Posting for some visibility: We only do memory leak testing for linux. This is because macos does not support I think it's sufficient to just test memleaks on linux. |
target_compile_options(visit_expression PRIVATE -Wall -Wextra -Wpedantic -Werror -Wno-strict-prototypes) | ||
|
||
# Get info on the OS and platform | ||
if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachschuermann just wanted to have a second look. Slightly yucky, but it's the way I found to target linux for valgrind. Note that macos does not support valgrind.
That seems reasonable -- I don't see why leak behavior should be platform-dependent. A leak is a leak is a leak, most likely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks
Rationale for the change
Engines may need to apply expressions that originate in the kernel. This can include filter predicates such as filters for protocol and metadata actions in a log file. Another example is logical to physical transformations that the engine must apply, for instance to materialize partition columns. This PR introduces a visitor framework for kernel expressions to be converted to engine expressions so that engines can apply their own evaluators, predicate pushdowns, and optimizations on these expressions.
Closes: #358
What changes are included in this PR?
5 major changes are made in this PR:
ffi/src/expressions/kernel.rs
is added. This module is responsible for converting kernelExpression
s to the engine's representation of expressions through an ffi-compatible visitor. The visitor supports binary operations, scalars, struct expressions, variadic and unary operators, and columns. All the engine has to do is implement anEngineExpressionVisitor
to support the conversion.EnginePredicate
to kernelExpression
visitor code is moved toffi/src/expressions/engine.rs
. This is done to keep all ffi expression code under theffi/src/expressions
module.SharedExpression
s, which are handles to the kernel'sExpression
type. These are given back to the kernel to specify which expression the engine would like to visit.EngineExpressionVisitor
can be used to construct expressions. This is a C implementation that can be found inffi/examples/visit-expression/
. The example only prints an expression that it receives from the kernel and asserts that it matches an expected output.test-ffi
is added to theffi
crate. This feature flag is used to expose testing functions found inffi/src/test_ffi
. This module is currently used to return a kernel expressionget_testing_kernel_expression
used in the example c implementation.Are these changes tested?
Two tests are added:
ffi/src/test_ffi.rs
, and the expected output is inffi/tests/test-expression-visitor/expected.txt
Are there any user-facing changes?
This PR introduces a public facing API for constructing expressions and visiting them. It also adds
delta_kernel_derive
as a public export ofkernel
crate. This PR does not break existing APIs.